Goto

Collaborating Authors

 medical domain



Enhancing the Medical Context-Awareness Ability of LLMs via Multifaceted Self-Refinement Learning

Zhou, Yuxuan, Wang, Yubin, Wang, Bin, Ning, Chen, Liu, Xien, Wu, Ji, Hao, Jianye

arXiv.org Artificial Intelligence

Large language models (LLMs) have shown great promise in the medical domain, achieving strong performance on several benchmarks. However, they continue to underperform in real-world medical scenarios, which often demand stronger context-awareness, i.e., the ability to recognize missing or critical details (e.g., user identity, medical history, risk factors) and provide safe, helpful, and contextually appropriate responses. To address this issue, we propose Multifaceted Self-Refinement (MuSeR), a data-driven approach that enhances LLMs' context-awareness along three key facets (decision-making, communication, and safety) through self-evaluation and refinement. Specifically, we first design a attribute-conditioned query generator that simulates diverse real-world user contexts by varying attributes such as role, geographic region, intent, and degree of information ambiguity. An LLM then responds to these queries, self-evaluates its answers along three key facets, and refines its responses to better align with the requirements of each facet. Finally, the queries and refined responses are used for supervised fine-tuning to reinforce the model's context-awareness ability. Evaluation results on the latest HealthBench dataset demonstrate that our method significantly improves LLM performance across multiple aspects, with particularly notable gains in the context-awareness axis. Furthermore, by incorporating knowledge distillation with the proposed method, the performance of a smaller backbone LLM (e.g., Qwen3-32B) surpasses its teacher model, achieving a new SOTA across all open-source LLMs on HealthBench (63.8%) and its hard subset (43.1%). Code and dataset will be released at https://muser-llm.github.io.



CBP-Tuning: Efficient Local Customization for Black-box Large Language Models

Zhao, Jiaxuan, Gu, Naibin, Feng, Yuchen, Liu, Xiyu, Fu, Peng, Lin, Zheng, Wang, Weiping

arXiv.org Artificial Intelligence

The high costs of customizing large language models (LLMs) fundamentally limit their adaptability to user-specific needs. Consequently, LLMs are increasingly offered as cloud-based services, a paradigm that introduces critical limitations: providers struggle to support personalized customization at scale, while users face privacy risks when exposing sensitive data. To address this dual challenge, we propose Customized Black-box Prompt Tuning (CBP-Tuning), a novel framework that facilitates efficient local customization while preserving bidirectional privacy. Specifically, we design a two-stage framework: (1) a prompt generator trained on the server-side to capture domain-specific and task-agnostic capabilities, and (2) user-side gradient-free optimization that tailors soft prompts for individual tasks. This approach eliminates the need for users to access model weights or upload private data, requiring only a single customized vector per task while achieving effective adaptation. Furthermore, the evaluation of CBP-Tuning in the commonsense reasoning, medical and financial domain settings demonstrates superior performance compared to baselines, showcasing its advantages in task-agnostic processing and privacy preservation.


MedResearcher-R1: Expert-Level Medical Deep Researcher via A Knowledge-Informed Trajectory Synthesis Framework

Yu, Ailing, Yao, Lan, Liu, Jingnan, Chen, Zhe, Yin, Jiajun, Wang, Yuan, Liao, Xinhao, Ye, Zhiling, Li, Ji, Yue, Yun, Xiao, Hansong, Zhou, Hualei, Guo, Chunxiao, Wei, Peng, Liu, Junwei, Gu, Jinjie

arXiv.org Artificial Intelligence

Recent developments in Large Language Model (LLM)-based agents have shown impressive capabilities spanning multiple domains, exemplified by deep research systems that demonstrate superior performance on complex information-seeking and synthesis tasks. While general-purpose deep research agents have shown impressive capabilities, they struggle significantly with medical domain challenges, as evidenced by leading proprietary systems achieving limited accuracy on complex medical benchmarks. The key limitations are: (1) the model lacks sufficient dense medical knowledge for clinical reasoning, and (2) the framework is constrained by the absence of specialized retrieval tools tailored for medical contexts. We present a medical deep research agent that addresses these challenges through two core innovations. First, we develop a novel data synthesis framework using medical knowledge graphs, extracting the longest chains from subgraphs around rare medical entities to generate complex multi-hop question-answer pairs. Second, we integrate a custom-built private medical retrieval engine alongside general-purpose tools, enabling accurate medical information synthesis. Our approach generates 2100+ diverse trajectories across 12 medical specialties, each averaging 4.2 tool interactions. Through a two-stage training paradigm combining supervised fine-tuning and online reinforcement learning with composite rewards, our MedResearcher-R1-32B model demonstrates exceptional performance, establishing new state-of-the-art results on medical benchmarks while maintaining competitive performance on general deep research tasks. Our work demonstrates that strategic domain-specific innovations in architecture, tool design, and training data construction can enable smaller open-source models to outperform much larger proprietary systems in specialized domains.


Towards Efficient Prompt-based Continual Learning in Distributed Medical AI

Oh, Gyutae, Shin, Jitae

arXiv.org Artificial Intelligence

Modern AI models achieve state-of-the-art performance with large-scale, high-quality datasets; however, ethical, social, and institutional constraints in the medical domain severely restrict data sharing, rendering centralized learning nearly impossible. Each institution must incrementally update models using only local data. Traditional training overfits new samples and suffers from catastrophic forgetting, losing previously acquired knowledge. Medical data distributions also shift due to varying diagnostic equipment and demographics. Although continual learning (CL) has advanced, most methods address natural images, leaving medical-domain-specific CL underexplored. We propose a prompt-based continual learning (PCL) approach featuring a unified prompt pool with a minimal expansion strategy: by expanding and freezing a subset of prompts, our method reduces computational overhead, and a novel regularization term balances retention and adaptation. Experiments on three diabetic retinopathy datasets Aptos2019, LI2019, and Diabetic Retinopathy Detection show our model improves final classification accuracy by at least 10% and F1-score by 9 points over state-of-the-art approaches while lowering inference cost. We anticipate this study will drive sustainable medical AI advances, enabling real-time diagnosis, patient monitoring, and telemedicine applications in distributed healthcare. Code will be released upon acceptance


Using Scaling Laws for Data Source Utility Estimation in Domain-Specific Pre-Training

Ostapenko, Oleksiy, Guille-Escuret, Charles, Kumar, Luke, Tian, Max, Kocetkov, Denis, Subbaraj, Gopeshh, Li, Raymond, Lamy-Poirier, Joel, Paquet, Sebastien, Scholak, Torsten

arXiv.org Artificial Intelligence

We introduce a framework for optimizing domain-specific dataset construction in foundation model training. Specifically, we seek a cost-efficient way to estimate the quality of data sources (e.g. synthetically generated or filtered web data, etc.) in order to make optimal decisions about resource allocation for data sourcing from these sources for the stage two pre-training phase, aka annealing, with the goal of specializing a generalist pre-trained model to specific domains. Our approach extends the usual point estimate approaches, aka micro-annealing, to estimating scaling laws by performing multiple annealing runs of varying compute spent on data curation and training. This addresses a key limitation in prior work, where reliance on point estimates for data scaling decisions can be misleading due to the lack of rank invariance across compute scales -- a phenomenon we confirm in our experiments. By systematically analyzing performance gains relative to acquisition costs, we find that scaling curves can be estimated for different data sources. Such scaling laws can inform cost effective resource allocation across different data acquisition methods (e.g. synthetic data), data sources (e.g. user or web data) and available compute resources. We validate our approach through experiments on a pre-trained model with 7 billion parameters. We adapt it to: a domain well-represented in the pre-training data -- the medical domain, and a domain underrepresented in the pretraining corpora -- the math domain. We show that one can efficiently estimate the scaling behaviors of a data source by running multiple annealing runs, which can lead to different conclusions, had one used point estimates using the usual micro-annealing technique instead. This enables data-driven decision-making for selecting and optimizing data sources.


Barriers in Integrating Medical Visual Question Answering into Radiology Workflows: A Scoping Review and Clinicians' Insights

Mishra, Deepali, Silpasuwanchai, Chaklam, Modi, Ashutosh, Sushil, Madhumita, Chumnanvej, Sorayouth

arXiv.org Artificial Intelligence

Medical Visual Question Answering (MedVQA) is a promising tool to assist radiologists by automating medical image interpretation through question answering. Despite advances in models and datasets, MedVQA's integration into clinical workflows remains limited. This study systematically reviews 68 publications (2018-2024) and surveys 50 clinicians from India and Thailand to examine MedVQA's practical utility, challenges, and gaps. Following the Arksey and O'Malley scoping review framework, we used a two-pronged approach: (1) reviewing studies to identify key concepts, advancements, and research gaps in radiology workflows, and (2) surveying clinicians to capture their perspectives on MedVQA's clinical relevance. Our review reveals that nearly 60% of QA pairs are non-diagnostic and lack clinical relevance. Most datasets and models do not support multi-view, multi-resolution imaging, EHR integration, or domain knowledge, features essential for clinical diagnosis. Furthermore, there is a clear mismatch between current evaluation metrics and clinical needs. The clinician survey confirms this disconnect: only 29.8% consider MedVQA systems highly useful. Key concerns include the absence of patient history or domain knowledge (87.2%), preference for manually curated datasets (51.1%), and the need for multi-view image support (78.7%). Additionally, 66% favor models focused on specific anatomical regions, and 89.4% prefer dialogue-based interactive systems. While MedVQA shows strong potential, challenges such as limited multimodal analysis, lack of patient context, and misaligned evaluation approaches must be addressed for effective clinical integration.


Rethinking Test-Time Scaling for Medical AI: Model and Task-Aware Strategies for LLMs and VLMs

Oh, Gyutaek, Kim, Seoyeon, Park, Sangjoon, Kim, Byung-Hoon

arXiv.org Artificial Intelligence

Extending test-time scaling to medicine, m1 [34] adapts s1's methodology using small datasets with reasoning traces and thinking token budgets, enabling lightweight models under 10B parameters to achieve state-of-the-art medical reasoning with a 4K token budget. For comprehensive medical understanding, multimodal emerged. HuatuoGPT -Vision [37] integrates visual and textual medical knowledge as a 34B multimodal LLM trained on 1.3 million medical visual question answering (VQA) samples. MedGemma [38] is a Google's open-source medical AI collection combining multimodal capabilities, available in 4B multimodal built on Gemma 3 architecture. Beyond architecture, reinforcement learning has been increasingly leveraged for test-time scaling and improved model robustness in vision-language medical models. MedVLM-R1 [33] uses reinforcement learning to generate natural language reasoning alongside answers. Med-R1 [32] employs group relative policy optimization (GRPO) [23] to improve gener-alizability across various medical imaging modalities, achieving 29.94% accuracy improvement. Both models demonstrate reinforcement learning effectiveness in medical AI, with MedVLM-R1 focusing on reasoning transparency and Med-R1 emphasizing cross-modality generalization.


Knowledge or Reasoning? A Close Look at How LLMs Think Across Domains

Wu, Juncheng, Liu, Sheng, Tu, Haoqin, Yu, Hang, Huang, Xiaoke, Zou, James, Xie, Cihang, Zhou, Yuyin

arXiv.org Artificial Intelligence

Recent advances in reasoning-enhanced Large Language Models such as OpenAI-o1/3 and DeepSeek-R1 have significantly improved performance on complex tasks. However, the quality and transparency of their internal reasoning processes remain underexplored. This work moves beyond the final-answer accuracy and investigates step-by-step reasoning in the medical and mathematical domains by explicitly decomposing the thinking trajectories into two parts: knowledge and reasoning. Specifically, we introduce a fine-grained evaluation framework that judges: (1) the correctness of knowledge used (measured by Knowledge Index (KI)) and (2) the quality of reasoning (measured by Information Gain (InfoGain)). Using this framework, we study R1-distilled and base Qwen models trained with supervised fine-tuning (SFT) and/or reinforcement learning (RL) in the medical and math domains. Three intriguing findings emerge: (1) The general reasoning abilities in R1-distilled models do not transfer effectively to the medical domain through either SFT or RL. (2) SFT raises final-answer accuracy in both domains, but often at the cost of reasoning quality: InfoGain drops by 38.9% on average compared with untrained models; In the medical domain, however, SFT remains crucial because domain knowledge is indispensable. (3) RL enhances medical reasoning by pruning inaccurate or irrelevant knowledge from reasoning paths, thereby improving both reasoning accuracy and knowledge correctness.